Before we begin, let us load some packages that we will potentially be using. For this assignment, we will be wrangling part of the gapminder dataset.

library(gapminder)
library(tidyverse)
library(dplyr)
library(forcats)
library(ggplot2)
library(tibble)
library(DT)
library(plotly)
library(gridExtra)
library(scales)

Exercise 1

The here::here package in RStudio projects is a great package that has its primary advantages when sharing code with collaborators. When using other methods to specify the directory in your code, the code may malfunction due to the collaborators using different operating systems. This package also replaces the need to specify the working directory in the code, which again would cause the code to malfunction if collaborators are trying to run the code without having the same file path expressed on their computer. Other advantages to the package include the ease in which sub-directories can be managed and even used if the files are opened outside of a project on RStudio.

Exercise 2

Part 1

Let’s explore the gapminder dataset. Before beginning, let’s verify that a variable within the dataset, such as continent, is a factor.

gapminder$continent %>%
  class()
## [1] "factor"

The variable continent is indeed a factor.

Let’s look at how many levels there are within this factor, and let’s identify the different levels.

gapminder$continent %>%
  nlevels()
## [1] 5
gapminder$continent %>%
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

There are 5 levels, which are listed above; the number of entries for each continent can be found in the table below.

gapminder %>%
  count(continent) %>%
  datatable()

There are relatively few entries for Oceania, so let’s remove Oceania from the dataset. First, let’s filter through the dataset to remove observations from Oceania.

Ex2 <- c("Africa", "Americas", "Asia", "Europe")
Ex2_filter <- gapminder %>%
  filter(continent %in% Ex2)

Ex2_filter %>%
  count(continent) %>%
  datatable()
Ex2_filter$continent %>%
  nlevels()
## [1] 5
Ex2_filter$continent %>%
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Oceania was successfully filtered from the dataset! However, we can see that the number of levels didn’t change, and Oceania is still listed as a level; Oceania is now considered an unused level. Let’s drop this unused level.

Ex2_dropped <- Ex2_filter %>% 
  droplevels()

Ex2_dropped$continent %>%
  nlevels()
## [1] 4
Ex2_dropped$continent %>%
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"

We successfully dropped the unused levels, and Oceania is no longer listed as a level.

Part 2

Now that we have removed Oceania, let’s look at the standard deviation of life expectancy in the remaining data and display them in a visual manner using boxplots.

Ex2_P2 <- Ex2_dropped %>%
  group_by(continent) %>%
  summarize(sigma = sd(lifeExp))

Ex2_P2_r = Ex2_P2
Ex2_P2_r$sigma = round(Ex2_P2_r$sigma, 1)

datatable(Ex2_P2_r)
Ex2_dropped %>%
  select(continent, lifeExp) %>%
  arrange(continent) %>%
  ggplot(aes(continent, lifeExp)) +
    geom_boxplot() +
    ylab("Life expectancy") +
    xlab("Continent") +
    theme_bw()

Let’s reorder the boxplots based on the standard deviation.

Ex2_dropped %>%
  select(continent, lifeExp) %>%
  arrange(continent) %>%
  ggplot(aes(fct_relevel(continent, "Asia", "Americas", "Africa", "Europe"), lifeExp)) +
    geom_boxplot() +
    ylab("Life expectancy") +
    xlab("Continent") +
    theme_bw()

Now the boxes become tighter from left to right!

Exercise 3

Let’s create a dataset with only data from Poland. Only data about the year, life expectancy, population, and GDP per capita will be kept since the country and continent will be redundant.

Poland <- gapminder %>%
  filter(country == "Poland") %>%
  select(year, lifeExp, pop, gdpPercap)

write_csv(Poland, here::here("hw05", "Poland.csv"))

Now that we have created a CSV file in our working directories, let’s read in that CSV file so we can see the data!

Poland_file = here::here("hw05", "Poland.csv")

Poland_new <- read_csv(Poland_file)
datatable(Poland_new)

The CSV file was successfully reloaded and survived the “round trip” of writing to file and reading back in that same file! Let’s play around with factor levels. First, let’s identify which variables are factors.

Poland$year %>%
  class()
## [1] "integer"
Poland$lifeExp %>%
  class()
## [1] "numeric"
Poland$pop %>%
  class()
## [1] "integer"
Poland$gdpPercap %>%
  class()
## [1] "numeric"

There are no factors in this dataset, which makes sense seeing that there are no categorical data here; the categorical data from the original dataset were removed due to them being redundent in this dataset (the variables country and continent will always be Poland and Europe, respectively).

Exercise 4

Many figures were created while completing previous assignments. However, in light of recent developments, let’s go back to Assignment 3 and recreate the figure from Task 3, showing both the original figure with the new figure side-by-side.

Ex3.3_o <- gapminder %>%
  group_by(continent) %>%
  select(continent, gdpPercap) %>%
  arrange(continent) %>%
  ggplot(aes(continent, gdpPercap)) +
    geom_boxplot() +
    ylab("GDP Per Capita") +
    scale_y_continuous(labels = dollar_format()) +
    xlab("Continent") +
    theme_bw()

Ex3.3_n <- gapminder %>%
  group_by(continent) %>%
  select(continent, gdpPercap) %>%
  arrange(continent) %>%
  ggplot(aes(continent, gdpPercap)) +
    geom_boxplot() +
    scale_y_continuous(labels = dollar_format()) +
    ggtitle("GDP per capita by continent") +
    xlab("") +
    ylab("") +
    theme_bw() +
    theme(panel.border = element_blank(), 
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(), 
          axis.line = element_line(colour = "grey"))

grid.arrange(Ex3.3_o, Ex3.3_n, nrow = 1)

There are some notable differences in the second figure. It includes a title , which incorporates both axes and eliminates the need to label them, which would be redundant. The distracting gridlines (both major and minor) have been removed; the top and right axes borders have been removed and have been lightened to grey for minimal distraction. Now, for even more improvement, let’s make the second figure interactive to easily see all the values!

Ex3.3_n2 <- ggplotly(Ex3.3_n)

Ex3.3_n2

Exercise 5

Let’s save the improved plot that we just made in Exercise 4 (the figure on the right side) as a separate file.

ggsave(plot = Ex3.3_n, "gdp_by_continent.png")

Now let’s embed that saved figure in the report below!